Syllable Reconstruction in Concatenated Waveform Speech Synthesis
نویسندگان
چکیده
In general purpose concatenated waveform synthesis an exhaustive stored waveforms inventory is needed. Our SPRUCE system is syllable and word based, but for general purpose work its inventory needs examples of all possible syllables. The high-level synthesis engine used to generate the phonology and prosody of utterances is already general purpose – but its use is constrained by small low-level inventories of re-combinable waveforms. The feasibility study reported here was carried out to determine whether we could take one of the word based limited domain versions of the system, and make it more general by excising syllables from existing polysyllabic words and recombining them into new words. Initially the study treats temporal rather than spectral considerations. 1. PRELIMINARIES Concatenated waveform synthesis [1] uses an inventory of stored waveforms. This paper reports experiments in enlarging MeteoSPRUCE – a weather forecasting application of our general purpose high-level tts engine SPRUCE [2] to widen its usability without the need for re-recording [3] [4] [5] [6] [7]. Before embarking on the task of excising and recombining we needed to be clear on a number of basic theoretical points: • Phonological symbolic representations [8] are of limited use for identifying syllables in the waveform. The phonological concept boundary carries uneasily through to the waveform. • Phonetic representations [9] are also symbolic, and although we can identify an allophone string corresponding to a phonological syllable there is still often no clear feature for acoustically delimiting syllables. • The notion boundary as a point for cutting a waveform is misleading. Acoustic syllables often overlap, telescope or merge, and one syllable may ‘begin’ before the previous one has ‘ended’; that is, the time allocated to a sequenced pair of syllables is not always the sum of the individual times. • Coarticulation [10] or coproduction [10] [11] responsible for temporal overlap is also responsible for spectral overlap. Even if cuts are made at the ‘right’ places there is a problem of including spectral boundary effects from both syllables when they are recombined in new but ‘wrong’ contexts. 2. A SIMPLE EXAMPLE The 2000-word MeteoSPRUCE database includes waveforms of the words unsettled and likely: let’s try using these to create a new word unlikely – i.e. to detach the syllable un and place it in front of the like syllable of likely. Phonetic syllable boundaries are marked in the database morphemically if possible or phonologically. Fig.1 shows the database entries. Fig.1 unsettled and likely in the MeteoSPRUCE database. By cutting unsettled at the end of the last pitch period of un we can paste the beginning of the file to the start of likely to produce a new reconstructed word object *unlikely. Fig.2 compares the result of conjoining the syllables with a recording of unlikely which on this occasion is in the database. Fig.2 Reconstruction of *unlikely, and the recorded waveform of unlikely in MeteoSPRUCE. The degree of coproduction between syllables is context dependent – we deliberately picked the syllable un in unsettled because it showed the minimum of ‘telescoping’ coproduction. Fig.3 Reconstruction of *unlikely using the derived synthetic syllable un and the recorded word likely (also normalised at the beginning of the word to form the synthetic syllable like). page 2303 ICPhS99 San Francisco So far, we have identified three stages in the reconstruction procedure: a. phonetic syllable excision, b. normalisation, c. synthetic syllable conjoining. There are errors in the reconstruction, and the transition between the syllables un and like appears protracted and awkwardly joined. An improvement (Fig.3) is obtained by a normalising procedure dealing with syllable overlap. The procedure involves setting up a synthetic syllable, derived in the normalisation process from the phonetic syllable. 3. IDENTIFYING AND DESCRIBING SYLLABLES To clarify the concept of recovery: it may be possible to excise a stretch of waveform of the right length from a suitable word, but because of coproduction effects it is unlikely to be directly reusable except in a similar context. Recovery means excision and reconstruction. The excised stretch of waveform – the phonetic syllable – is going to be used as the basis for reconstructing the desired waveform – the synthetic syllable. The procedure we have developed for syllable recovery calls for syllable models defined on three different levels. Phonological syllable – a unit higher than the ‘sound’ segment [12]. Introduced to form a framework for characterising the sequencing of simple segments, it provides the primary unit for modelling prosody. Phonetic detail is irrelevant at this level: nonlinear organisation into syllabic units is important. We characterise phonological syllables as in linguistics [13]. In our model the phonological syllable figures prominently because it enables direct reference to a listener’s perception of ‘sound’ sequencing – the phonological syllable characterises for us the result of successful perception. Since our synthesis philosophy revolves around satisfying a listener’s perceptual abilities we need a level specifically designed to capture this. So, listeners identify a unit at the beginning of unsettled, pronounce it in isolation and tell us that it is the same as a unit identified at the start of the word unlikely. This cognitive similarity is not the same as acoustic similarity – coarticulatory phenomena constrain the two uns to be systematically different acoustically. The goal of the reconstruction procedure is to use a portion of the waveform of unsettled to change likely into a correctly perceived new word unlikely. Phonetic syllable – a descriptive unit characterising part of a human acoustic signal prompting a listener to identify a phonological syllable. This is where distinguishing acoustic features are identified, as well as other acoustic features. The model describes the waveform as in acoustic phonetics [14]. What ‘sounds’ are sequenced in a phonetic syllable is a phonological rather than phonetic matter in our reconstruction procedure. The phonetic syllable is the waveform which triggers the phonological syllable – and its phonetic description. There has been a lot of discussion concerning the relationship between phonetic and phonological characterisations of the same stretch of speech [15]. The phonetic syllable models the acoustic signal and the phonological syllable models a cognitive response to the signal. The models are linked since they each deal with the same signal. Notice that we are using the term to refer to both a stretch of waveform and its acoustic model. Synthetic syllable – a model of an acoustic stretch which can be manipulated to trigger in the listener a response of the right phonological syllable. The synthetic syllable may or may not be the same as the phonetic syllable from which it is derived. In SPRUCE a waveform in the database can be a phonetic syllable (modelling the human syllable, e.g. snow), but it is also there as a synthetic syllable – a model for concatenation to produce a new word, e.g. snowing. The synthetic syllable derives from a phonetic entry in the database by a normalisation procedure which varies in complexity depending on syllable type and the environment from which it is to be excised – that is, the normalisation process is both context and type sensitive. 4. SYLLABLE TYPES AND CONTEXTS We classify syllable types by their phonological start (onset) and end (coda). Initially we were concerned about coarticulatory effects between phonetic syllables, i.e. that reconstructed words should have the correct temporal and spectral phonetic properties at new syllable boundaries. However, taking full account of all acoustic effects of quality change resulting from coproduction all combinations would need to be considered. For this initial study we reduced the problem to a working model of temporal syllable combining. Defocusing phonetic quality at syllable boundaries, we refocused on temporal properties of onset and offset. Examination of all words in the database revealed that our working model might need deal only in initial and final segment types, rather than all possible occurring individual segments. We established segment types according to the usual phonetic parameters [4]. So, all syllables include a vowel segment preceded by up to three phonetic consonants and followed by up to four:
منابع مشابه
Removal of Spectral Discontinuity in Concatenated Speech Waveform
Speech synthesis systems which involve concatenation of recorded speech units are currently very popular. These systems are known for producing high quality, natural-sounding speech as they generate speech by joining together waveforms of different speech units. This method of speech generation is quite practical. However the speech units that are being concatenated may have different spectra o...
متن کاملAutomatic segmentation of recorded speech into syllables for speech synthesis
Concatenated waveform text-to-speech synthesis systems require an inventory of stored waveforms from which units of speech can be extracted for subsequent rearrangement and concatenation as needed. In previous papers [1], [2] we have argued that for natural sounding speech the syllable should be the preferred unit. The mark-up of the stored waveforms for segmentation into syllables must be prec...
متن کاملDevelopment of Concatenative Syllable based Text to Speech Synthesis System for Tamil
This paper addresses the problem of improving the intelligibility of the synthesized speech in Tamil TTS synthesis system. The human speech is artificially generated by Speech synthesis. The normal language text will be automatically converted into speech using Text-to-speech (TTS) system. This paper deals with a corpus-driven Tamil TTS system based on the concatenative synthesis approach. Conc...
متن کاملImproving naturalness of Thai text-to-speech synthesis by prosodic rule
This paper presents a method to improve the naturalness of Thai Text-to-speech synthesis, in 4 main parts. In the pausing module, its main function is to determine the break location when synthesizing a Thai text which has no explicit sentence/phrase/word boundary. In the syllable duration and tone generation, a set of rules is provided to generate proper prosodic parameters for synthesizing mo...
متن کاملAn Unit Selection based Hindi Text To Speech Synthesis System Using Syllable as a Basic Unit
Concatenative speech synthesis using phoneme, di-phone and allophone as an elementary unit for Hindi speech synthesis requires significant quality improvement. The naturalness of the state of the art waveform synthesizer is attributed due to the use of syllable as a basic unit. The primary reason for choosing the syllable as a basic unit is that the Indian languages are syllable centered. This ...
متن کامل